gh-142183: Change data stack to use a resizable array by dpdani · Pull Request #148681 · python/cpython

dpdani · 2026-04-17T12:24:36Z

This PR changes the implementation of the Python stack to use a resizable array. This avoids the problem of calls that frequently cause the datastack_top (now called stack_top) pointer to switch between allocations.

After resizing, previous array allocations are not immediately freed because that would cause issues for various bits around the VM still pointing into them, and are instead freed along with the tstate.

During resizing, the previous contents of the stack are not copied into the new allocations, and instead the memory of the previous allocation is still used. Subsequently, popping and pushing frames, the new frames will always be residing on the new stack chunk allocation.

Overall it results in a ±1% performance change (within the noise range), but it avoids degenerate cases for any number of frames. I am also told it would allow further optimizations in the JIT.

Issue: Calls across stack chunks perform badly #142183

python-cla-bot · 2026-04-17T12:24:41Z

All commit authors signed the Contributor License Agreement.

dpdani · 2026-04-27T13:26:24Z

@pablogsal can you take a look this week? 🙏

pablogsal · 2026-04-27T13:27:38Z

@pablogsal can you take a look this week? 🙏

I can try but I have some other PRs first in my review queue :(

pablogsal

Thanks for working on this. I am worried about a couple of consequences that I think we should account for before continuing with this:

A minor concern is that this seems to change Python stack memory from “roughly current depth” usage to “high-water mark for the lifetime of the thread”. In the old chunked implementation, deep recursion allocated additional 16 KiB chunks and then released most of them while unwinding. In this version, resize_stack() keeps previous stack chunks linked from stack_chunk_list, and _PyThreadState_PopFrame() only moves stack_top; the chunks are not freed until the thread state is deleted.

This doesn't seem to be a lot so I am not too worried.

The worse concern I have I’m also concerned about profiler/debugger consequences. The old stack chunk layout allowed _remote_debugging/external unwinders to bulk-copy stack chunks cheaply. With this change, the active frame chain can span older chunks while only the newest chunk is copied in the new _remote_debugging path, so older frames fall back to individual remote reads. For a 1000-frame stack I measured:

old no-cache unwinding: 4 memory reads, ~1.2 KiB read
new no-cache unwinding: 966 memory reads, ~85.8 KiB read

Tachyon is probably mostly insulated because it uses frame caching, but first samples, cache-disabled paths, fallback paths, and external tools still care. Other profilers like austin currently hard-codes the old _PyStackChunk layout (previous, size, top, data), while this patch changes it to (size, previous, data), so those tools need explicit updates.

Given this and the potential gains I do not find the tradeoff very convincing....

bedevere-app · 2026-04-27T22:52:03Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

pablogsal · 2026-04-27T22:54:14Z

-        chunk_addr = GET_MEMBER(uintptr_t, chunks[count].local_copy, offsetof(_PyStackChunk, previous));
-        count++;
+    // Process this chunk
+    if (process_single_stack_chunk(unwinder, chunk_addr, &chunks[count]) < 0) {


Unnless I’m missing something, stopping after a single chunk here looks like a large perf regression for the profiler. The runtime still has a linked chunk chain via stack_chunk_list -> previous, but this now only copies the newest chunk. If the active frame chain spans older chunks, find_frame_in_chunks() misses those frames and we fall back to parse_frame_object(), which does one remote memory read per frame.

pablogsal · 2026-04-27T22:59:20Z

Unfortunately the more I think about it the less I like it: this model is harder to reason about than the previous linked-list model, and I think that matters because it creates a very easy footgun: the current chunk looks like the current stack backing store, but it is not.

Existing live frames may still be in older chunks, while newer frames are in the newest chunk at a matching logical offset. So any code that treats stack_chunk_list as “the stack” instead of “the head of a chain that may contain live frames” can silently become wrong. That already seems to happen here: copying only the head chunk misses older live frames and turns what used to be a cheap bulk-copy unwind into many per-frame remote reads.

The risk is not just external profilers. This makes future runtime/debugger code more fragile because pointer validity and frame ownership now depend on searching the whole chunk chain, not checking the current chunk.

markshannon · 2026-04-28T09:24:21Z

@pablogsal
I don't understand why this PR would cause 1000 remote reads. A 1000 frames stack will only span a few chunks, since they grow exponentially. For very large stacks, there will be fewer chunks to copy, not more.

Existing live frames may still be in older chunks, while newer frames are in the newest chunk at a matching logical offset.

There seems to be a misunderstanding here. No two frames will have the same offset.

This makes future runtime/debugger code more fragile because pointer validity and frame ownership now depend on searching the whole chunk chain, not checking the current chunk.

The stack is a linked list of frames, not chunks. Some of which (generator and coroutine frames) aren't in chunks at all, so tools already need to handle pointers outside of the current chunk.

Overall, I don't see how this really changes anything for an out-of-process profiler: Copy all the chunks, then traverse the stack.

Also, note that the lower, unused part of the current chunk in a stack with multiple chunks will be untouched, so a profiler should be able to detect when it needs to cross to another chunk.

Other profilers like austin currently hard-codes the old _PyStackChunk layout (previous, size, top, data), while this patch changes it to (size, previous, data), so those tools need explicit updates.

That is trade off that tools and libraries make: either they use stable API/ABIs, or they probe CPython internals. If they do the latter, they will need updating every release.

@P403n1x87 would this be a problem for you?

pablogsal · 2026-04-28T10:12:11Z

I don't understand why this PR would cause 1000 remote reads. A 1000 frames stack will only span a few chunks, since they grow exponentially. For very large stacks, there will be fewer chunks to copy, not more.

The issue I was pointing at is the current _remote_debugging implementation in this PR: copy_stack_chunks() now processes only the head chunk and sets out_chunks->count = 1. Once frame->previous points into an older chunk, find_frame_in_chunks() misses and we fall back to parse_frame_object(), which reads the frame remotely. That is where the ~966 remote reads for a 1000-frame stack came from.

The fix here is to either restore eager copying of the full stack_chunk_list -> previous chain using the new _PyStackChunk layout, or lazily copy previous chunks when the frame walk crosses out of the copied head chunk.

There seems to be a misunderstanding here. No two frames will have the same offset.

Also, you’re right that “same logical offset” was poor wording. I meant that the new chunk starts allocating at the old stack depth, leaving the lower part of the new chunk unused; not that two frames have the same offset.

pablogsal · 2026-04-28T10:15:41Z

Here is a repro:

import os
import subprocess
import sys
import _remote_debugging

DEPTH = 1000

child_code = f"""
import os, sys, time
sys.setrecursionlimit({DEPTH + 1000})

def f(n):
    if n == 0:
        print("READY", os.getpid(), flush=True)
        time.sleep(10)
        return
    f(n - 1)

f({DEPTH})
"""

p = subprocess.Popen(
    [sys.executable, "-c", child_code],
    stdout=subprocess.PIPE,
    text=True,
)

line = p.stdout.readline().strip()
pid = int(line.split()[1])

try:
    unwinder = _remote_debugging.RemoteUnwinder(
        pid,
        all_threads=True,
        cache_frames=False,
        stats=True,
        debug=True,
    )
    trace = unwinder.get_stack_trace()
    frames = sum(len(t.frame_info) for interp in trace for t in interp.threads)

    print("frames:", frames)
    print("stats:", unwinder.get_stats())
finally:
    p.terminate()
    p.wait()

In main we can see:

❯ ./python repro.py
frames: 1002
stats: {'total_samples': 1, 'frame_cache_hits': 0, 'frame_cache_misses': 0, 'frame_cache_partial_hits': 0, 'frames_read_from_cache': 0, 'frames_read_from_memory': 0, 'memory_reads': 4, 'memory_bytes_read': 1200, 'code_object_cache_hits': 1000, 'code_object_cache_misses': 2, 'stale_cache_invalidations': 0, 'frame_cache_hit_rate': 0.0, 'code_object_cache_hit_rate': 99.8003992015968}

with this PR:

frames: 1002
stats: {'total_samples': 1, 'frame_cache_hits': 0, 'frame_cache_misses': 0, 'frame_cache_partial_hits': 0, 'frames_read_from_cache': 0, 'frames_read_from_memory': 0, 'memory_reads': 966, 'memory_bytes_read': 85768, 'code_object_cache_hits': 1000, 'code_object_cache_misses': 2, 'stale_cache_invalidations': 0, 'frame_cache_hit_rate': 0.0, 'code_object_cache_hit_rate': 99.8003992015968}

Notice memory_reads going from 4 to 966

dpdani · 2026-04-28T10:16:04Z

The fix here is to either restore eager copying of the full stack_chunk_list -> previous chain using the new _PyStackChunk layout, or lazily copy previous chunks when the frame walk crosses out of the copied head chunk.

Yes, that's my bad. I made the smallest changes possible to the remote debugging module to get the PR working, but didn't inspect further improvements. Maybe it can be done in a follow-up PR by people more knowledgeable on the module? Or would you consider that a blocker?

pablogsal · 2026-04-28T10:16:32Z

Maybe it can be done in a follow-up PR by people more knowledgeable on the module? Or would you consider that a blocker?

This is a blocker. This PR adds a regression and that's not acceptable.

pablogsal · 2026-04-28T10:22:54Z

@dpdani I pushed a fix for the concrete _remote_debugging regression: it now walks tstate->stack_chunk_list through previous and copies all stack chunks before traversing frames, instead of only copying the newest chunk. With the fix, the 1000-recursive-frame repro goes from roughly ~966 remote reads / ~85 KB read to 3 reads.

I also added a regression test to make sure deep stacks are resolved from copied chunks rather than falling back to parsing frames individually from remote memory.

pablogsal · 2026-04-28T10:24:21Z

That said, I still think this solution is very confusing and too complex. It is harder to reason about where frames live, which chunks are relevant, and what invariants the implementation can rely on. I am worried this makes future changes easier to get subtly wrong.

pablogsal · 2026-04-28T10:55:09Z

I investigated an alternative implementation in #149097: instead of replacing the stack with the resizable-array model, it keeps the current chunked- stack invariants and extends the existing one-chunk cache into a small bounded per-thread cache.

I think this is a better direction because it fixes the allocator-thrashing issue without changing where live frames can reside, without changing _PyStackChunk/PyThreadState layout, and without requiring _remote_debugging or external unwinders to learn a new stack model. Active frames remain only in the datastack_chunk -> previous chain; cached chunks are detached and inactive.

The cache is intentionally bounded: currently up to 8 * _PY_DATA_STACK_CHUNK_SIZE, so the memory cost is predictable and much smaller than retaining a high-water-mark stack for the lifetime of the thread.

I cehcked the original repro and it no longer shows per-branch mmap/munmap churn.

pythongh-142183: Change data stack to use a resizable array

cfaca85

dpdani requested review from FFY00, ZeroIntensity, ericsnowcurrently, markshannon and pablogsal as code owners April 17, 2026 12:24

bedevere-app Bot added the awaiting review label Apr 17, 2026

bedevere-app Bot mentioned this pull request Apr 17, 2026

Calls across stack chunks perform badly #142183

Open

📜🤖 Added by blurb_it.

bb08739

pablogsal self-assigned this Apr 17, 2026

skip tests failing on emscripten

99bb502

dpdani requested review from ambv and lysnikolaou as code owners April 20, 2026 12:33

Merge branch 'main' into pythongh-142183-stack-resizable-array

daa182a

pablogsal requested changes Apr 27, 2026

View reviewed changes

bedevere-app Bot removed the awaiting review label Apr 27, 2026

bedevere-app Bot added the awaiting changes label Apr 27, 2026

pablogsal reviewed Apr 27, 2026

View reviewed changes

pablogsal force-pushed the gh-142183-stack-resizable-array branch from 8115b28 to 239e2eb Compare April 28, 2026 10:21

Fix regression in profiler module

99e9e44

pablogsal force-pushed the gh-142183-stack-resizable-array branch from 239e2eb to 99e9e44 Compare April 28, 2026 10:46

Uh oh!

Conversation

dpdani commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

python-cla-bot Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpdani commented Apr 27, 2026

Uh oh!

pablogsal commented Apr 27, 2026

Uh oh!

pablogsal left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bedevere-app Bot commented Apr 27, 2026

Uh oh!

pablogsal Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

pablogsal commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markshannon commented Apr 28, 2026

Uh oh!

pablogsal commented Apr 28, 2026

Uh oh!

pablogsal commented Apr 28, 2026

Uh oh!

dpdani commented Apr 28, 2026

Uh oh!

pablogsal commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented Apr 28, 2026

Uh oh!

pablogsal commented Apr 28, 2026

Uh oh!

pablogsal commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dpdani commented Apr 17, 2026 •

edited

Loading

python-cla-bot Bot commented Apr 17, 2026 •

edited

Loading

pablogsal left a comment •

edited

Loading

pablogsal commented Apr 27, 2026 •

edited

Loading

pablogsal commented Apr 28, 2026 •

edited

Loading

pablogsal commented Apr 28, 2026 •

edited

Loading